Goto

Collaborating Authors

 provably efficient exploration


Provably Efficient Exploration for Reinforcement Learning Using Unsupervised Learning

Neural Information Processing Systems

Motivated by the prevailing paradigm of using unsupervised learning for efficient exploration in reinforcement learning (RL) problems [tang2017exploration,bellemare2016unifying], we investigate when this paradigm is provably efficient. We study episodic Markov decision processes with rich observations generated from a small number of latent states. We present a general algorithmic framework that is built upon two components: an unsupervised learning algorithm and a no-regret tabular RL algorithm. Theoretically, we prove that as long as the unsupervised learning algorithm enjoys a polynomial sample complexity guarantee, we can find a near-optimal policy with sample complexity polynomial in the number of latent states, which is significantly smaller than the number of observations. Empirically, we instantiate our framework on a class of hard exploration problems to demonstrate the practicality of our theory.


Review for NeurIPS paper: Provably Efficient Exploration for Reinforcement Learning Using Unsupervised Learning

Neural Information Processing Systems

Additional Feedback: This paper introduces a method for efficient exploration in RL. The proposed method assumes an MDP with high-dimensional states that are generated by an underlying lower-dimensional process, such that these states can be compressed via an unsupervised learning algorithm/oracle. The method then (1) defines an MDP over the resulting low-dimensional state space; and (2) learns a policy by generating trajectories in low-dimensional space, which arguably facilitates exploration. At each iteration, the algorithm gathers data to compute a policy and also to improve the embedding model computed by the unsupervised algorithm. The authors show that as long as the unsupervised algorithm and the tabular RL algorithm have polynomial sample complexity, it is possible to find a near-optimal policy with polynomial complexity in the number of latent states, which is much smaller than the number of high-dimensional states.


Provably Efficient Exploration for Reinforcement Learning Using Unsupervised Learning

Neural Information Processing Systems

Motivated by the prevailing paradigm of using unsupervised learning for efficient exploration in reinforcement learning (RL) problems [tang2017exploration,bellemare2016unifying], we investigate when this paradigm is provably efficient. We study episodic Markov decision processes with rich observations generated from a small number of latent states. We present a general algorithmic framework that is built upon two components: an unsupervised learning algorithm and a no-regret tabular RL algorithm. Theoretically, we prove that as long as the unsupervised learning algorithm enjoys a polynomial sample complexity guarantee, we can find a near-optimal policy with sample complexity polynomial in the number of latent states, which is significantly smaller than the number of observations. Empirically, we instantiate our framework on a class of hard exploration problems to demonstrate the practicality of our theory.


Review for NeurIPS paper: Provably Efficient Exploration for Reinforcement Learning Using Unsupervised Learning

Neural Information Processing Systems

The paper focuses on efficiently exploring MDPs with high dimensional state representations, by combining an unsupervised algorithm for learning a low-dimensional representation and then solving the problem in this low-dimensional space. The paper is largely theoretic and show that in certain conditions, near-optimal policies can be found with polynomial complexity in the number of latent states. The reviewers mostly agreed on the following points. The paper is considered well-written, and presents theoretically strong results that are sound, novel, and non-trivial. As weaknesses of the paper the reviewers mentioned the lack of empirical results in more realistic settings and restrictive assumptions.


Provably Efficient Exploration for Reinforcement Learning Using Unsupervised Learning

Neural Information Processing Systems

Motivated by the prevailing paradigm of using unsupervised learning for efficient exploration in reinforcement learning (RL) problems [tang2017exploration,bellemare2016unifying], we investigate when this paradigm is provably efficient. We study episodic Markov decision processes with rich observations generated from a small number of latent states. We present a general algorithmic framework that is built upon two components: an unsupervised learning algorithm and a no-regret tabular RL algorithm. Theoretically, we prove that as long as the unsupervised learning algorithm enjoys a polynomial sample complexity guarantee, we can find a near-optimal policy with sample complexity polynomial in the number of latent states, which is significantly smaller than the number of observations. Empirically, we instantiate our framework on a class of hard exploration problems to demonstrate the practicality of our theory.


Provably Efficient Exploration in Constrained Reinforcement Learning:Posterior Sampling Is All You Need

Provodin, Danil, Gajane, Pratik, Pechenizkiy, Mykola, Kaptein, Maurits

arXiv.org Artificial Intelligence

We present a new algorithm based on posterior sampling for learning in constrained Markov decision processes (CMDP) in the infinite-horizon undiscounted setting. The algorithm achieves near-optimal regret bounds while being advantageous empirically compared to the existing algorithms. Our main theoretical result is a Bayesian regret bound for each cost component of \tilde{O} (HS \sqrt{AT}) for any communicating CMDP with S states, A actions, and bound on the hitting time H. This regret bound matches the lower bound in order of time horizon T and is the best-known regret bound for communicating CMDPs in the infinite-horizon undiscounted setting. Empirical results show that, despite its simplicity, our posterior sampling algorithm outperforms the existing algorithms for constrained reinforcement learning.


r/MachineLearning - [R] Provably Efficient Exploration in Policy Optimization

#artificialintelligence

While policy-based reinforcement learning (RL) achieves tremendous successes in practice, it is significantly less understood in theory, especially compared with value-based RL. In particular, it remains elusive how to design a provably efficient policy optimization algorithm that incorporates exploration. To bridge such a gap, this paper proposes an Optimistic variant of the Proximal Policy Optimization algorithm (OPPO), which follows an "optimistic version" of the policy gradient direction. This paper proves that, in the problem of episodic Markov decision process with linear function approximation, unknown transition, and adversarial reward with full-information feedback, OPPO achieves O (\sqrt{d 3 H 3 T}) regret. Here d is the feature dimension, H is the episode horizon, and T is the total number of steps.